Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Fri Dec 04 00:38:14 2020"

I really feel that this course would be a challenging experience for me, since I am a beginner of R. I would like to learn something that I can use in my own study and research that can improve my working efficiency. I select this course from WebOodi and was attracted by the introduction of the course. The link to my GitHub repository: https://github.com/Liyuan202/IODS-project


Chapter 2 Regression and model validation

1.read the data by using the right command on reading the table “http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt”, and dim(dimension) and str(structure)is the command to see the dimensions and structures of the table. we can see that this table has 166 observations and 7 variables including “gender”,“age”,“attitude”,“deep”,“stra”, “surf” and “points”.the first six variables are the explanatory variables and the last one(points) is the dependent variable.

lrn14 <-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt ", sep = ",", header = TRUE)
 dim(lrn14)
## [1] 166   7
 str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

2.For using the gglot2 library by the command library(ggplot2), but here the thing that we need to pay attention is that we have to install the library(gglot2) first by utilizing the command “install.packages(”ggplot2“)”. From the graphical overview showed that the intercept is around 16, and the points increased with the increased attitude, and the distribution is normal.

library(ggplot2)
p1<-ggplot(lrn14,aes(x=attitude,y=points))
p1<-ggplot(lrn14,aes(x=attitude,y=points))
p2<-p1+geom_point()
p2

p3<-p2+geom_smooth(method="lm")
p4<-p3+ggtitle("student's attitude versus exam points")
p4
## `geom_smooth()` using formula 'y ~ x'

3.the summary of the model: the residuals is -17.4506 (Min) and 11.4820 (Max). And attitude (p = 4.24e-08) has statistically significant relationship with points, and the p value of other explanatory variables are all above 0.05, which means they do not have significant relationship with the target variable points.

my_model <- lm(points~attitude+stra+deep, data =lrn14)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + deep, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## stra          0.9621     0.5367   1.793  0.07489 .  
## deep         -0.7492     0.7507  -0.998  0.31974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08
my_model2 <- lm(points~attitude, data =lrn14)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

4.for the model validity, the three plots all showed that the errors of the model are nomally distributed.

par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))

It is really challenging for me!!


Chapter 3 Logistic regression

##read the data

Description: this data is the student alcohol consumption data, and there are 37 variables in this data. I am going to select four variables from this data for my analysis.

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt" 
alc = read.csv(url, sep=",")
dim(alc)
## [1] 382  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

analysis the relationship between high alcohol consumption and age, failures, absences and sex

Four variables that I selected: Age, failures, absences and sex

Hypothesis: 1. frade, sex, failures and absences can all predict the high alcohol consumption.

Numerically and graphically explore the distribution of the data

comments: the cross tabulation and charts showed that the distribution of age, failures, absences and sex for high/low alcohol use, and they showed that for age, the proportion of age 17 is the biggest proportion (35) of high alcohol use than other age group, but not that different from other groups; and male are more likely to be high alcohol use compared to female; failures 0 are more likely to be high alcohol use; absences 0 is more likely to be high use. These may partly comfirmed our Hypothesis: only failures, absences and sex have statistical significant relationship with alcohol consumption

Cross tabulation

malc <- glm(high_use ~ age + failures + absences + sex, data = alc, family = "binomial")
summary(malc)
## 
## Call:
## glm(formula = high_use ~ age + failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6644  -0.8051  -0.6026   1.0478   2.1115  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.82153    1.72779  -2.212   0.0270 *  
## age          0.11375    0.10394   1.094   0.2738    
## failures     0.37915    0.15333   2.473   0.0134 *  
## absences     0.07059    0.01795   3.932 8.43e-05 ***
## sexM         0.98875    0.24475   4.040 5.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 417.44  on 377  degrees of freedom
## AIC: 427.44
## 
## Number of Fisher Scoring iterations: 4
summary(alc)
##     school              sex                 age          address         
##  Length:382         Length:382         Min.   :15.00   Length:382        
##  Class :character   Class :character   1st Qu.:16.00   Class :character  
##  Mode  :character   Mode  :character   Median :17.00   Mode  :character  
##                                        Mean   :16.59                     
##                                        3rd Qu.:17.00                     
##                                        Max.   :22.00                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:382         Length:382         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :3.000  
##                                        Mean   :2.806   Mean   :2.565  
##                                        3rd Qu.:4.000   3rd Qu.:4.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            nursery         
##  Length:382         Length:382         Length:382         Length:382        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    internet           guardian           traveltime      studytime    
##  Length:382         Length:382         Min.   :1.000   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:1.000   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :1.000   Median :2.000  
##                                        Mean   :1.442   Mean   :2.034  
##                                        3rd Qu.:2.000   3rd Qu.:2.000  
##                                        Max.   :4.000   Max.   :4.000  
##     failures       schoolsup            famsup              paid          
##  Min.   :0.0000   Length:382         Length:382         Length:382        
##  1st Qu.:0.0000   Class :character   Class :character   Class :character  
##  Median :0.0000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.2906                                                           
##  3rd Qu.:0.0000                                                           
##  Max.   :3.0000                                                           
##   activities           higher            romantic             famrel    
##  Length:382         Length:382         Length:382         Min.   :1.00  
##  Class :character   Class :character   Class :character   1st Qu.:4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.00  
##                                                           Mean   :3.94  
##                                                           3rd Qu.:5.00  
##                                                           Max.   :5.00  
##     freetime         goout            Dalc            Walc          health     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:3.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.00   Median :4.000  
##  Mean   :3.223   Mean   :3.113   Mean   :1.474   Mean   :2.28   Mean   :3.579  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##     absences            G1              G2              G3       
##  Min.   : 0.000   Min.   : 3.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 8.25   1st Qu.: 8.00  
##  Median : 3.000   Median :10.50   Median :11.00   Median :11.00  
##  Mean   : 5.319   Mean   :10.86   Mean   :10.71   Mean   :10.39  
##  3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :75.000   Max.   :19.00   Max.   :19.00   Max.   :20.00  
##     alc_use       high_use      
##  Min.   :1.000   Mode :logical  
##  1st Qu.:1.000   FALSE:270      
##  Median :1.500   TRUE :112      
##  Mean   :1.877                  
##  3rd Qu.:2.500                  
##  Max.   :5.000
table(high_use = alc$high_use, age = alc$age)
##         age
## high_use 15 16 17 18 19 20 22
##    FALSE 63 79 65 55  7  1  0
##    TRUE  18 28 35 26  4  0  1
table(high_use = alc$high_use, sex = alc$sex)
##         sex
## high_use   F   M
##    FALSE 157 113
##    TRUE   41  71
table(high_use = alc$high_use, failures = alc$failures)
##         failures
## high_use   0   1   2   3
##    FALSE 234  22   5   9
##    TRUE   82  16   6   8
table(high_use = alc$high_use, absences = alc$absences)
##         absences
## high_use  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
##    FALSE 94  2 54  2 36  4 19  6 13  3 12  1  6  0  7  1  2  0  2  0  1  1  0
##    TRUE  22  1 13  4 15  0 11  2  8  0  5  1  4  3  4  2  4  1  2  1  1  0  3
##         absences
## high_use 23 24 25 26 28 30 54 56 75
##    FALSE  1  0  1  1  0  0  0  0  1
##    TRUE   0  1  0  0  1  1  1  1  0

graphically

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2) 
malc <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")

g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))
g1 + geom_bar()

alc <- mutate(alc, high_use = alc_use > 2)
g2 <- ggplot (alc, aes(high_use))
g2 + facet_wrap("age") + geom_bar()

g2 + facet_wrap("failures") + geom_bar()

g2 + facet_wrap("absences") + geom_bar()

g2 + facet_wrap("sex") + geom_bar()

library(tidyr)
glimpse(alc) 
## Rows: 382
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation...
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
gather(alc) %>% glimpse
## Rows: 13,370
## Columns: 2
## $ key   <chr> "school", "school", "school", "school", "school", "school", "...
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "...
gather(alc) %>% ggplot(aes(value))+ facet_wrap("key", scales = "free")+ geom_bar()

logistic regression to explore the relationship between high/low alcohol consumption and age, failures, absences and sex

Interpretation: the summary of the fitted model showed that failures, absences and sex have statistically significant relationship with high/low alcohol use (the p value of failures, absences and sex are all below 0.05), while age do not have statistically relationship with high/low alcohol use (the p value of age is above 0.05); interpret coefficients as odds ratio, use intercept of age(-3.82) + estimate od age (0.11) * the students is age what (since the original data can not see the category of the varibles, so it cannot calculated now); the odds ratio of age is 1.12 (>1), with 95% CI being 0.0007 and 0.6274, it means that age 15 is more likely to become alcohol consumption compared to other age, the odds ratio of failures is 1.46(>1), with 95% CI being 1.0796 and 1.9778, it means that the participates who are failures are more likely to become alcohol consumption, the odds ratio of absences is 1.07(>1), with 95% CI being 1.0383 and 1.1138, it means that the participates who absences are more likely to be alcohol consumption than those who not absences, the odds ratio of sex is 2.69(>1), it means that male are more likely to be alcohol consumption. These also partly confirmed our hypothesis.

malc <- glm(high_use ~ age + failures + absences + sex, data = alc, family = "binomial")
summary(malc)
## 
## Call:
## glm(formula = high_use ~ age + failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6644  -0.8051  -0.6026   1.0478   2.1115  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.82153    1.72779  -2.212   0.0270 *  
## age          0.11375    0.10394   1.094   0.2738    
## failures     0.37915    0.15333   2.473   0.0134 *  
## absences     0.07059    0.01795   3.932 8.43e-05 ***
## sexM         0.98875    0.24475   4.040 5.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 417.44  on 377  degrees of freedom
## AIC: 427.44
## 
## Number of Fisher Scoring iterations: 4
OR <- coef (malc) %>% exp
CI <- confint(malc) %>% exp
## Waiting for profiling to be done...
cbind (OR, CI)
##                     OR        2.5 %    97.5 %
## (Intercept) 0.02189419 0.0007075463 0.6273627
## age         1.12047076 0.9145032988 1.3757107
## failures    1.46104027 1.0795866408 1.9778080
## absences    1.07314328 1.0382625002 1.1137734
## sexM        2.68786068 1.6739407174 4.3782085

Using failures, absences and sex (have statistically significant) with high/low alcohol consumption to explore the predictive power of model and crosstab

comments: The results showed that the model can well predict the target variables, the accuracy is 74.35%. The probabilities of the first ten participants are 0.18, 0.16, 0.50, 0.14, 0.16, 0.44, 0.28, 0.18, 0.28, 0.28. And there is 12 + 86 = 98 individuals, about 25.65%, individuals are inaccurately calssified. A strategy that reveals, on average, a 30% chance of getting the guess right, and for this model, 74.35% chance can get the explanary variables for target variables right, so this model is more better than simple guessing strategy.

malc <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
summary(malc)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6629  -0.8545  -0.5894   1.0339   2.0428  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.95397    0.22819  -8.563  < 2e-16 ***
## failures     0.40462    0.15024   2.693  0.00708 ** 
## absences     0.07294    0.01796   4.061 4.88e-05 ***
## sexM         0.98848    0.24453   4.042 5.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 418.64  on 378  degrees of freedom
## AIC: 426.64
## 
## Number of Fisher Scoring iterations: 4
probabilities <- predict (malc, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select (alc, failures, absences, sex, high_use, probability, prediction) %>% tail(382)
##     failures absences sex high_use probability prediction
## 1          0        6   F    FALSE   0.1799998      FALSE
## 2          0        4   F    FALSE   0.1594640      FALSE
## 3          3       10   F     TRUE   0.4973115      FALSE
## 4          0        2   F    FALSE   0.1408685      FALSE
## 5          0        4   F    FALSE   0.1594640      FALSE
## 6          0       10   M    FALSE   0.4412412      FALSE
## 7          0        0   M    FALSE   0.2757800      FALSE
## 8          0        6   F    FALSE   0.1799998      FALSE
## 9          0        0   M    FALSE   0.2757800      FALSE
## 10         0        0   M    FALSE   0.2757800      FALSE
## 11         0        0   F    FALSE   0.1241213      FALSE
## 12         0        4   F    FALSE   0.1594640      FALSE
## 13         0        2   M    FALSE   0.3058446      FALSE
## 14         0        2   M    FALSE   0.3058446      FALSE
## 15         0        0   M    FALSE   0.2757800      FALSE
## 16         0        4   F    FALSE   0.1594640      FALSE
## 17         0        6   F    FALSE   0.1799998      FALSE
## 18         0        4   F    FALSE   0.1594640      FALSE
## 19         3       16   M     TRUE   0.8046070       TRUE
## 20         0        4   M    FALSE   0.3376586      FALSE
## 21         0        0   M    FALSE   0.2757800      FALSE
## 22         0        0   M    FALSE   0.2757800      FALSE
## 23         0        2   M    FALSE   0.3058446      FALSE
## 24         0        0   M     TRUE   0.2757800      FALSE
## 25         0        2   F    FALSE   0.1408685      FALSE
## 26         2       14   F    FALSE   0.4691333      FALSE
## 27         0        2   M    FALSE   0.3058446      FALSE
## 28         0        4   M     TRUE   0.3376586      FALSE
## 29         0        4   M    FALSE   0.3376586      FALSE
## 30         0       16   M     TRUE   0.5502035       TRUE
## 31         0        0   M     TRUE   0.2757800      FALSE
## 32         0        0   M    FALSE   0.2757800      FALSE
## 33         0        0   M    FALSE   0.2757800      FALSE
## 34         0        0   M    FALSE   0.2757800      FALSE
## 35         0        0   M    FALSE   0.2757800      FALSE
## 36         0        0   F    FALSE   0.1241213      FALSE
## 37         0        2   M    FALSE   0.3058446      FALSE
## 38         0        7   M    FALSE   0.3881878      FALSE
## 39         0        2   F    FALSE   0.1408685      FALSE
## 40         0        8   F    FALSE   0.2025430      FALSE
## 41         1       25   F    FALSE   0.5680898       TRUE
## 42         0        8   M     TRUE   0.4056448      FALSE
## 43         0        2   M    FALSE   0.3058446      FALSE
## 44         0        0   M    FALSE   0.2757800      FALSE
## 45         1       14   F    FALSE   0.3709274      FALSE
## 46         0        8   F    FALSE   0.2025430      FALSE
## 47         0       12   F     TRUE   0.2537465      FALSE
## 48         0        4   M    FALSE   0.3376586      FALSE
## 49         0        2   M    FALSE   0.3058446      FALSE
## 50         1        2   F    FALSE   0.1972647      FALSE
## 51         0        2   F     TRUE   0.1408685      FALSE
## 52         0        2   F    FALSE   0.1408685      FALSE
## 53         1        6   M     TRUE   0.4692248      FALSE
## 54         0        0   F     TRUE   0.1241213      FALSE
## 55         0        6   F     TRUE   0.1799998      FALSE
## 56         0        8   F    FALSE   0.2025430      FALSE
## 57         0        0   F    FALSE   0.1241213      FALSE
## 58         0        4   M    FALSE   0.3376586      FALSE
## 59         0        2   M    FALSE   0.3058446      FALSE
## 60         0        2   F    FALSE   0.1408685      FALSE
## 61         0        6   F     TRUE   0.1799998      FALSE
## 62         0        6   F     TRUE   0.1799998      FALSE
## 63         0        4   F    FALSE   0.1594640      FALSE
## 64         0        2   F     TRUE   0.1408685      FALSE
## 65         0        0   F     TRUE   0.1241213      FALSE
## 66         0        2   F    FALSE   0.1408685      FALSE
## 67         0        4   M     TRUE   0.3376586      FALSE
## 68         0        4   F    FALSE   0.1594640      FALSE
## 69         0        2   F    FALSE   0.1408685      FALSE
## 70         0       12   F     TRUE   0.2537465      FALSE
## 71         0        0   M    FALSE   0.2757800      FALSE
## 72         0        0   M    FALSE   0.2757800      FALSE
## 73         2        2   F     TRUE   0.2691651      FALSE
## 74         0        2   M    FALSE   0.3058446      FALSE
## 75         0       54   F     TRUE   0.8791712       TRUE
## 76         0        6   M     TRUE   0.3710132      FALSE
## 77         0        8   M    FALSE   0.4056448      FALSE
## 78         0        0   F    FALSE   0.1241213      FALSE
## 79         3        2   M    FALSE   0.5973005       TRUE
## 80         3        2   M    FALSE   0.5973005       TRUE
## 81         0       12   F    FALSE   0.2537465      FALSE
## 82         0        2   M    FALSE   0.3058446      FALSE
## 83         0        4   M    FALSE   0.3376586      FALSE
## 84         0       10   F    FALSE   0.2271275      FALSE
## 85         0        4   M    FALSE   0.3376586      FALSE
## 86         0        2   F     TRUE   0.1408685      FALSE
## 87         2        6   F     TRUE   0.3302363      FALSE
## 88         0        4   F    FALSE   0.1594640      FALSE
## 89         0        4   F    FALSE   0.1594640      FALSE
## 90         1       12   M    FALSE   0.5779498       TRUE
## 91         0       18   M     TRUE   0.5859787       TRUE
## 92         0        0   F    FALSE   0.1241213      FALSE
## 93         0        4   F    FALSE   0.1594640      FALSE
## 94         0        4   F     TRUE   0.1594640      FALSE
## 95         0        0   F    FALSE   0.1241213      FALSE
## 96         0        6   M    FALSE   0.3710132      FALSE
## 97         1        2   F    FALSE   0.1972647      FALSE
## 98         0        2   M    FALSE   0.3058446      FALSE
## 99         0        2   F    FALSE   0.1408685      FALSE
## 100        0        6   F    FALSE   0.1799998      FALSE
## 101        0        0   F    FALSE   0.1241213      FALSE
## 102        0       14   M     TRUE   0.5139014       TRUE
## 103        0        0   M    FALSE   0.2757800      FALSE
## 104        0        4   M    FALSE   0.3376586      FALSE
## 105        0       26   F    FALSE   0.4855995      FALSE
## 106        0        0   M    FALSE   0.2757800      FALSE
## 107        0       10   F    FALSE   0.2271275      FALSE
## 108        0        8   F    FALSE   0.2025430      FALSE
## 109        0        2   M    FALSE   0.3058446      FALSE
## 110        0        2   M    FALSE   0.3058446      FALSE
## 111        0        6   M     TRUE   0.3710132      FALSE
## 112        0        4   F    FALSE   0.1594640      FALSE
## 113        0        6   M    FALSE   0.3710132      FALSE
## 114        1        0   F    FALSE   0.1751799      FALSE
## 115        1        6   F    FALSE   0.2475480      FALSE
## 116        0       10   M    FALSE   0.4412412      FALSE
## 117        0        8   M    FALSE   0.4056448      FALSE
## 118        0        2   M    FALSE   0.3058446      FALSE
## 119        0        2   M    FALSE   0.3058446      FALSE
## 120        0        2   M    FALSE   0.3058446      FALSE
## 121        0        0   M    FALSE   0.2757800      FALSE
## 122        1       20   M     TRUE   0.7105085       TRUE
## 123        0        6   M    FALSE   0.3710132      FALSE
## 124        0        2   F    FALSE   0.1408685      FALSE
## 125        0        6   M    FALSE   0.3710132      FALSE
## 126        0        2   F    FALSE   0.1408685      FALSE
## 127        0       18   M     TRUE   0.5859787       TRUE
## 128        0        0   F    FALSE   0.1241213      FALSE
## 129        0        0   M     TRUE   0.2757800      FALSE
## 130        0        0   F    FALSE   0.1241213      FALSE
## 131        3        2   F    FALSE   0.3556611      FALSE
## 132        0        8   M     TRUE   0.4056448      FALSE
## 133        2        0   F    FALSE   0.2414519      FALSE
## 134        0        0   F    FALSE   0.1241213      FALSE
## 135        0       12   F    FALSE   0.2537465      FALSE
## 136        0       16   F     TRUE   0.3128168      FALSE
## 137        0        0   M    FALSE   0.2757800      FALSE
## 138        0        0   F    FALSE   0.1241213      FALSE
## 139        0        0   M     TRUE   0.2757800      FALSE
## 140        2        0   F    FALSE   0.2414519      FALSE
## 141        1        0   M    FALSE   0.3633449      FALSE
## 142        0        0   F    FALSE   0.1241213      FALSE
## 143        0        0   M    FALSE   0.2757800      FALSE
## 144        2        8   M    FALSE   0.6052127       TRUE
## 145        0        2   F    FALSE   0.1408685      FALSE
## 146        0        2   F     TRUE   0.1408685      FALSE
## 147        3        0   M    FALSE   0.5617719       TRUE
## 148        3        0   M    FALSE   0.5617719       TRUE
## 149        0        0   F    FALSE   0.1241213      FALSE
## 150        3        0   F    FALSE   0.3229780      FALSE
## 151        0        2   F    FALSE   0.1408685      FALSE
## 152        0        0   M    FALSE   0.2757800      FALSE
## 153        0        0   M    FALSE   0.2757800      FALSE
## 154        3        0   M     TRUE   0.5617719       TRUE
## 155        3        0   M     TRUE   0.5617719       TRUE
## 156        1        6   M     TRUE   0.4692248      FALSE
## 157        2        8   F     TRUE   0.3632598      FALSE
## 158        3        0   M    FALSE   0.5617719       TRUE
## 159        0        0   F    FALSE   0.1241213      FALSE
## 160        0        2   M    FALSE   0.3058446      FALSE
## 161        0        8   M     TRUE   0.4056448      FALSE
## 162        3        6   F     TRUE   0.4249464      FALSE
## 163        0        2   M    FALSE   0.3058446      FALSE
## 164        1        4   M     TRUE   0.4331208      FALSE
## 165        2        0   M    FALSE   0.4610144      FALSE
## 166        3        0   M     TRUE   0.5617719       TRUE
## 167        0        4   M     TRUE   0.3376586      FALSE
## 168        0        0   F    FALSE   0.1241213      FALSE
## 169        0        0   F    FALSE   0.1241213      FALSE
## 170        0        0   F    FALSE   0.1241213      FALSE
## 171        2        0   M     TRUE   0.4610144      FALSE
## 172        0        2   M    FALSE   0.3058446      FALSE
## 173        0        0   M    FALSE   0.2757800      FALSE
## 174        3        0   F    FALSE   0.3229780      FALSE
## 175        0        4   F    FALSE   0.1594640      FALSE
## 176        0        4   M     TRUE   0.3376586      FALSE
## 177        0        2   F     TRUE   0.1408685      FALSE
## 178        0        4   M     TRUE   0.3376586      FALSE
## 179        0       10   M     TRUE   0.4412412      FALSE
## 180        0        4   M    FALSE   0.3376586      FALSE
## 181        0       10   M     TRUE   0.4412412      FALSE
## 182        0        2   M    FALSE   0.3058446      FALSE
## 183        0        2   M    FALSE   0.3058446      FALSE
## 184        0        0   F     TRUE   0.1241213      FALSE
## 185        0       56   F     TRUE   0.8938304       TRUE
## 186        0       14   F    FALSE   0.2823456      FALSE
## 187        0       12   M     TRUE   0.4774520      FALSE
## 188        0        2   M    FALSE   0.3058446      FALSE
## 189        0        0   M    FALSE   0.2757800      FALSE
## 190        0        6   F    FALSE   0.1799998      FALSE
## 191        0        4   M     TRUE   0.3376586      FALSE
## 192        0       10   F    FALSE   0.2271275      FALSE
## 193        0        0   F    FALSE   0.1241213      FALSE
## 194        0       12   M     TRUE   0.4774520      FALSE
## 195        0        8   M     TRUE   0.4056448      FALSE
## 196        0        0   M    FALSE   0.2757800      FALSE
## 197        0        0   F    FALSE   0.1241213      FALSE
## 198        0        4   M    FALSE   0.3376586      FALSE
## 199        0        8   M     TRUE   0.4056448      FALSE
## 200        1       24   F     TRUE   0.5501125       TRUE
## 201        0        0   F    FALSE   0.1241213      FALSE
## 202        0        2   F     TRUE   0.1408685      FALSE
## 203        0        6   F    FALSE   0.1799998      FALSE
## 204        0        4   F    FALSE   0.1594640      FALSE
## 205        0       18   F    FALSE   0.3449956      FALSE
## 206        0        6   F    FALSE   0.1799998      FALSE
## 207        1       28   F     TRUE   0.6207826       TRUE
## 208        3        5   F    FALSE   0.4072279      FALSE
## 209        0       10   F    FALSE   0.2271275      FALSE
## 210        0        6   F     TRUE   0.1799998      FALSE
## 211        0        6   F    FALSE   0.1799998      FALSE
## 212        0       10   F    FALSE   0.2271275      FALSE
## 213        0       13   M     TRUE   0.4956709      FALSE
## 214        0        0   F    FALSE   0.1241213      FALSE
## 215        1       15   M     TRUE   0.6302227       TRUE
## 216        0       12   F    FALSE   0.2537465      FALSE
## 217        0        2   F    FALSE   0.1408685      FALSE
## 218        2       22   F     TRUE   0.6129829       TRUE
## 219        1       13   M     TRUE   0.5956324       TRUE
## 220        0        3   F     TRUE   0.1499290      FALSE
## 221        0        4   F    FALSE   0.1594640      FALSE
## 222        0        2   F    FALSE   0.1408685      FALSE
## 223        1        0   F    FALSE   0.1751799      FALSE
## 224        0        2   F    FALSE   0.1408685      FALSE
## 225        0        0   M     TRUE   0.2757800      FALSE
## 226        0        0   F    FALSE   0.1241213      FALSE
## 227        1       16   F    FALSE   0.4055561      FALSE
## 228        0       10   F    FALSE   0.2271275      FALSE
## 229        0        2   M    FALSE   0.3058446      FALSE
## 230        0       14   M     TRUE   0.5139014       TRUE
## 231        0       10   F    FALSE   0.2271275      FALSE
## 232        0       14   F    FALSE   0.2823456      FALSE
## 233        0        4   M    FALSE   0.3376586      FALSE
## 234        0       14   M    FALSE   0.5139014       TRUE
## 235        0        2   M     TRUE   0.3058446      FALSE
## 236        0       18   M    FALSE   0.5859787       TRUE
## 237        0       10   M    FALSE   0.4412412      FALSE
## 238        0        4   M     TRUE   0.3376586      FALSE
## 239        0       20   F    FALSE   0.3786606      FALSE
## 240        0        2   F    FALSE   0.1408685      FALSE
## 241        1        0   M     TRUE   0.3633449      FALSE
## 242        0       14   M     TRUE   0.5139014       TRUE
## 243        0        2   M     TRUE   0.3058446      FALSE
## 244        0        0   M    FALSE   0.2757800      FALSE
## 245        0        0   M    FALSE   0.2757800      FALSE
## 246        0        0   M    FALSE   0.2757800      FALSE
## 247        0        0   F    FALSE   0.1241213      FALSE
## 248        0        6   M    FALSE   0.3710132      FALSE
## 249        0        4   M    FALSE   0.3376586      FALSE
## 250        3       16   M     TRUE   0.8046070       TRUE
## 251        1        8   M    FALSE   0.5056539       TRUE
## 252        0        0   M     TRUE   0.2757800      FALSE
## 253        1        0   M     TRUE   0.3633449      FALSE
## 254        0        6   M    FALSE   0.3710132      FALSE
## 255        1        4   M     TRUE   0.4331208      FALSE
## 256        0        0   M    FALSE   0.2757800      FALSE
## 257        0        0   M     TRUE   0.2757800      FALSE
## 258        1        2   M    FALSE   0.3977132      FALSE
## 259        0        6   F    FALSE   0.1799998      FALSE
## 260        0       12   M    FALSE   0.4774520      FALSE
## 261        0        8   M    FALSE   0.4056448      FALSE
## 262        0        0   F    FALSE   0.1241213      FALSE
## 263        0       21   F    FALSE   0.3959664      FALSE
## 264        0        2   M    FALSE   0.3058446      FALSE
## 265        0        1   M    FALSE   0.2905828      FALSE
## 266        0        4   F    FALSE   0.1594640      FALSE
## 267        0        0   F    FALSE   0.1241213      FALSE
## 268        0       13   M     TRUE   0.4956709      FALSE
## 269        0        2   M     TRUE   0.3058446      FALSE
## 270        0        8   F    FALSE   0.2025430      FALSE
## 271        0       10   M    FALSE   0.4412412      FALSE
## 272        0        0   F    FALSE   0.1241213      FALSE
## 273        2       15   F     TRUE   0.4873308      FALSE
## 274        0        4   F    FALSE   0.1594640      FALSE
## 275        0        2   F    FALSE   0.1408685      FALSE
## 276        0        2   M    FALSE   0.3058446      FALSE
## 277        0        2   F    FALSE   0.1408685      FALSE
## 278        0        6   F     TRUE   0.1799998      FALSE
## 279        0       75   F    FALSE   0.9711472       TRUE
## 280        0       22   M     TRUE   0.6545527       TRUE
## 281        0       22   M     TRUE   0.6545527       TRUE
## 282        1       15   F    FALSE   0.3881005      FALSE
## 283        0        8   M    FALSE   0.4056448      FALSE
## 284        0       30   M     TRUE   0.7725216       TRUE
## 285        1       19   M     TRUE   0.6952794       TRUE
## 286        0        1   F    FALSE   0.1322705      FALSE
## 287        0        4   F    FALSE   0.1594640      FALSE
## 288        0        4   F    FALSE   0.1594640      FALSE
## 289        0        4   F    FALSE   0.1594640      FALSE
## 290        0        2   M    FALSE   0.3058446      FALSE
## 291        0        5   F    FALSE   0.1694845      FALSE
## 292        0        6   F    FALSE   0.1799998      FALSE
## 293        0        6   M    FALSE   0.3710132      FALSE
## 294        0        9   M    FALSE   0.4233435      FALSE
## 295        0       11   M     TRUE   0.4592929      FALSE
## 296        0        0   F    FALSE   0.1241213      FALSE
## 297        0        6   F    FALSE   0.1799998      FALSE
## 298        0        8   M    FALSE   0.4056448      FALSE
## 299        0        4   M    FALSE   0.3376586      FALSE
## 300        0        0   F     TRUE   0.1241213      FALSE
## 301        0       10   F    FALSE   0.2271275      FALSE
## 302        0        0   F    FALSE   0.1241213      FALSE
## 303        0        5   M    FALSE   0.3541586      FALSE
## 304        0       14   F    FALSE   0.2823456      FALSE
## 305        0        0   M    FALSE   0.2757800      FALSE
## 306        0        0   F    FALSE   0.1241213      FALSE
## 307        0        0   F    FALSE   0.1241213      FALSE
## 308        0        0   M    FALSE   0.2757800      FALSE
## 309        1        0   M    FALSE   0.3633449      FALSE
## 310        0        0   F    FALSE   0.1241213      FALSE
## 311        0        9   F    FALSE   0.2145795      FALSE
## 312        0        0   F     TRUE   0.1241213      FALSE
## 313        0        2   F     TRUE   0.1408685      FALSE
## 314        0       23   F    FALSE   0.4313299      FALSE
## 315        0       12   F    FALSE   0.2537465      FALSE
## 316        0        3   F    FALSE   0.1499290      FALSE
## 317        0        1   F     TRUE   0.1322705      FALSE
## 318        0        0   F     TRUE   0.1241213      FALSE
## 319        0        3   M    FALSE   0.3215447      FALSE
## 320        0        3   M     TRUE   0.3215447      FALSE
## 321        0        8   M     TRUE   0.4056448      FALSE
## 322        0        7   F    FALSE   0.1910175      FALSE
## 323        0        7   F    FALSE   0.1910175      FALSE
## 324        0        4   F    FALSE   0.1594640      FALSE
## 325        0        2   M     TRUE   0.3058446      FALSE
## 326        0        7   F    FALSE   0.1910175      FALSE
## 327        0        0   F    FALSE   0.1241213      FALSE
## 328        0        0   F    FALSE   0.1241213      FALSE
## 329        0        0   F    FALSE   0.1241213      FALSE
## 330        0       16   F    FALSE   0.3128168      FALSE
## 331        0        0   F     TRUE   0.1241213      FALSE
## 332        0        7   F    FALSE   0.1910175      FALSE
## 333        0        4   F     TRUE   0.1594640      FALSE
## 334        1        0   M    FALSE   0.3633449      FALSE
## 335        1        0   M    FALSE   0.3633449      FALSE
## 336        0       11   M    FALSE   0.4592929      FALSE
## 337        1        0   F    FALSE   0.1751799      FALSE
## 338        0        4   F    FALSE   0.1594640      FALSE
## 339        0        7   F     TRUE   0.1910175      FALSE
## 340        0        9   M    FALSE   0.4233435      FALSE
## 341        0        0   M     TRUE   0.2757800      FALSE
## 342        0        0   F    FALSE   0.1241213      FALSE
## 343        1       10   M     TRUE   0.5420231       TRUE
## 344        3        8   M     TRUE   0.6967457       TRUE
## 345        0        2   M     TRUE   0.3058446      FALSE
## 346        1        7   M     TRUE   0.4874227      FALSE
## 347        1        4   M     TRUE   0.4331208      FALSE
## 348        0        4   M    FALSE   0.3376586      FALSE
## 349        0        0   F    FALSE   0.1241213      FALSE
## 350        0        4   F    FALSE   0.1594640      FALSE
## 351        0        2   F    FALSE   0.1408685      FALSE
## 352        0        4   M    FALSE   0.3376586      FALSE
## 353        0        0   F    FALSE   0.1241213      FALSE
## 354        0        0   F    FALSE   0.1241213      FALSE
## 355        0        0   F    FALSE   0.1241213      FALSE
## 356        0        0   F    FALSE   0.1241213      FALSE
## 357        0        4   M     TRUE   0.3376586      FALSE
## 358        0        0   M    FALSE   0.2757800      FALSE
## 359        1        0   F    FALSE   0.1751799      FALSE
## 360        0        0   F    FALSE   0.1241213      FALSE
## 361        0       10   F     TRUE   0.2271275      FALSE
## 362        0        3   M     TRUE   0.3215447      FALSE
## 363        0        8   F    FALSE   0.2025430      FALSE
## 364        0       14   F    FALSE   0.2823456      FALSE
## 365        0        0   F    FALSE   0.1241213      FALSE
## 366        0        2   F    FALSE   0.1408685      FALSE
## 367        0        4   F     TRUE   0.1594640      FALSE
## 368        0        0   F    FALSE   0.1241213      FALSE
## 369        0       17   F     TRUE   0.3287053      FALSE
## 370        0        4   M     TRUE   0.3376586      FALSE
## 371        0        5   M    FALSE   0.3541586      FALSE
## 372        0        2   M    FALSE   0.3058446      FALSE
## 373        1        0   M    FALSE   0.3633449      FALSE
## 374        1       14   M     TRUE   0.6130701       TRUE
## 375        0        2   F    FALSE   0.1408685      FALSE
## 376        0        7   F    FALSE   0.1910175      FALSE
## 377        1        0   F    FALSE   0.1751799      FALSE
## 378        0        0   F    FALSE   0.1241213      FALSE
## 379        1        0   F    FALSE   0.1751799      FALSE
## 380        1        0   F    FALSE   0.1751799      FALSE
## 381        0        3   M     TRUE   0.3215447      FALSE
## 382        0        0   M     TRUE   0.2757800      FALSE
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   258   12
##    TRUE     86   26
library(dplyr)
library(ggplot2)
galc <- ggplot (alc, aes(x = age, y = high_use, col = prediction))
galc + geom_point()

galc <- ggplot (alc, aes(x = failures, y = high_use, col = prediction))
galc + geom_point()

galc <- ggplot (alc, aes(x = absences, y = high_use, col = prediction))
galc + geom_point()

galc <- ggplot (alc, aes(x = sex, y = high_use, col = prediction))
galc + geom_point()

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67539267 0.03141361 0.70680628
##    TRUE  0.22513089 0.06806283 0.29319372
##    Sum   0.90052356 0.09947644 1.00000000

10-fold cross-validation

Comments: the prediction error of my model is not higher but equal to the model introduced in DataCamp. Since, I choose age, failures, absences and sex as selected variables, but the age do not have statistically relationship with high/low alcohol use, so the error is 0.2670157, so the variables used in my model is bigger to Datacamp.

library(boot)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2565445
cv <- cv.glm(data = alc, cost = loss_func, glmfit = malc, K = 10)
cv$delta[1]
## [1] 0.2670157

Chapter 4 Clustering and classification

1.Load the Boston data

(Comments: The Boston data is about the housing value in suburbs of Boston. It has 506 rows and 14 columns. The columns include crim, zn, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat and medv.)

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

2.The graphical overview of the data and the summaries of the variables

(Interpretation: From the summary of the data, it showed that the Min, max and mean od the variables as following: crim (Min: 0.006; Max:88.976;M:3.614), zn (Min: 0.00; Max: 100; M:11.36), indus(Min: 0.46; Max:27.74; M:11.14),chas(Min:0.00; Max: 1.00;M:0.07), nox:(Min:0.385; Max: 0.871; M: 0.555); rm(Min:3.56; Max:8.78;M:6.28); age(Min:2.9; Max: 100; M: 68.57); dis(Min: 1.13; Max: 12.13;M:3.80); rad(Min: 1.00;Max:24.00;M:9.55);tax(Min:187.0; Max:711.0;M:408.2); ptratio(Min:12.60;Max:22.00;M: 18.46);black(Min:0.32;Max:396.90; M:356.67); lstat (Min:1.73; Max:37.97; M: 12.65) and medv(Min:5.00; Max:50;M:22.53), which means that the crime rate have big differences from the min to max, and this is also the characteristics for other variables. About the relationship between variables, the correlations between variables are varied from 0.04 to 0.77)

(1)summary

summary (Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

(2)correlation and graph

library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
cor_matrix<-cor(Boston) %>% round(digits=2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

pairs(Boston, col = "blue", pch = 18, main = "Matrix plot of the variables")

3.standardize the dataset and print out summary of it +create a categorical variable of the crime

(The Min., 1st Qu., Median, Mean, 3rd Qu. and Max. of the variables are changed, the maximum of the variables is 10.00)

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled
correct_classes <- test$crime # save crime
test <- dplyr::select(test, -crime) # remove crime

4.fit the linear discriminant analysis on the train set

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2376238 0.2351485 0.2574257 0.2698020 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       0.87299230 -0.9006928 -0.10828322 -0.8630622  0.39050574 -0.8745878
## med_low  -0.09502722 -0.3875725 -0.06511327 -0.5947407 -0.08592337 -0.3745933
## med_high -0.38746839  0.2354597  0.21980846  0.4237487  0.08026343  0.4903567
## high     -0.48724019  1.0169738 -0.09172814  1.0348811 -0.41126902  0.8278190
##                 dis        rad        tax     ptratio      black       lstat
## low       0.9138336 -0.6708280 -0.7405495 -0.41971430  0.3758943 -0.76785105
## med_low   0.3814662 -0.5514983 -0.5317292 -0.04850343  0.3179065 -0.21242752
## med_high -0.3953277 -0.3999076 -0.2801063 -0.25171842  0.1186751  0.03292365
## high     -0.8546954  1.6395837  1.5150965  0.78247128 -0.8255074  0.95959586
##                 medv
## low       0.46544383
## med_low   0.02847967
## med_high  0.13945458
## high     -0.72098471
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.11613829  0.622531385 -1.15202249
## indus    0.04242438 -0.301022944  0.05532421
## chas    -0.08955225 -0.094857305 -0.01903477
## nox      0.34569006 -0.729163635 -1.19889267
## rm      -0.06643385 -0.127245998 -0.11391186
## age      0.36023137 -0.468894605 -0.18799369
## dis     -0.01912492 -0.318474962  0.09924561
## rad      2.91665829  1.172768047 -0.20016571
## tax      0.06734470 -0.251947241  0.58686009
## ptratio  0.13333119 -0.031444649 -0.24826103
## black   -0.12506004 -0.009150835  0.07691901
## lstat    0.23053639 -0.161797914  0.41515853
## medv     0.18690846 -0.359917161 -0.18088624
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9448 0.0438 0.0114
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){ heads <- coef(x)
         arrows(x0 = 0, y0 = 0, x1 = myscale * heads[,choices[1]], y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads), cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

5. predict the classes with LDA

(Comments: In the cross tabulate, we can see that the correct and predicted number of crime categories with four categories, including low, med-low, med-high and high. The correct and predicted are equal on low with 70, med-low with 77, med-high with 80, and high with 126.)

(1)predict the the classes with the LDA model

lda.pred <- predict(lda.fit, newdata = test)

(2)cross tabulate the results

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       72      50        5    0
##   med_low   25      72       29    0
##   med_high   3      29       88    6
##   high       0       0        1  126

6.Reload the Boston data set,standardized the dataset, calculate the distances and run k-means

Interpretation: the summary of the distance showed that the min is 2.016, the median is 279.505, the mean is 342.899 and the max is 1198.265; the optimal number cluster is 2 and so I run the the algorithm again with the centers is 2.

(1)Reload and standardize the Boston dataset

library(MASS)
data("Boston")
summary("Boston")
##    Length     Class      Mode 
##         1 character character
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

(2)Calculate the distance between the variables

dist_woman <- dist(boston_scaled, method = 'manhattan')
summary(dist_woman)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

(3)Run k-means algorithm

km <- kmeans(boston_scaled, centers = 3)

(4)Investigate the optimal number clusters and run algorithm again

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

(5)Investigate the best optimal cluster number and run it again and visualize the clusters

library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')

km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

Bonus: Perform the k-means with >2 clusters

km2 <-kmeans(boston_scaled, centers = 4)
pairs(boston_scaled, col = km2$cluster)

Super bonus

model_predictors <- dplyr::select(train, -crime)

check the dimensions

dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3

matrix multiplication

matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

Chapter 5 Dimensional reduction techniques

show a graphical overview of the data and summary of the variables

human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep  =",", header = T)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(human)

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
library(corrplot)
library(dplyr)
cor(human) %>% corrplot

Comments: The results of the graph and summary showed that there are 8 variables, including Edu2.FM, Labo.FM, Edu.Exp,Life.Exp, GNI, Mat.Mor, Ado.Birth and Parli.F.; About the relationships between them:In the figure below, red cicles indicate negative correlations and blue positive. The bigger and darker the circle, the higher the correlation. High positive correlation, i.e., Edu.exp and Life.exp.In turn, negative correlations i.e., Mat.Mor and Life.exp. The distributions of the variables is that the shape of it showed some are left skew, some are right skew, and some are symmetric, and the mean of GIN is 17628, other means can also seen in the figures and summary.

perform principal component analysis of unstandardized human data, show the variability and draw biplot

pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

standardized the data and repeat the above analysis

Interpretation the results of unstandardized and standardized: As shown in the figure, the results of two analysis are different, the reason for this is that PCA is usually a numerical approximation decomposition rather than seeking eigenvalues, singular values to obtain analytical solutions, so when we use gradient descent and other algorithms for PCA, we have to standardize the data first. This is Conducive to the convergence of the gradient descent method.

human_std <- scale(human)
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human_stan <- prcomp(human_std)
biplot(pca_human_stan, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

Interpretation of plots: the plot of unstandized PCA showed that all variables are pointing like PC1, so all the variables contribute to PC1. But the plot of standized PCA showed that Parli.F and Labo.FM are pointing like PC2, so these two variables contribute to PC2, while other varibles contribute to PC1. The index of human development consists of various criteria, including life expectancy at birth, education expectancy and ect.. The education expectancy do not have a good relation with PC2.

Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.

The first principal component dimension is correlated with Edu.exp, Life.exp, GNI, Edu2.FM, Mat.mor and Ado.Birth, while the second component dimensions is correlated with Parli.F and Labo.FM. And the angle between arrows representing the correlation between the variables, like the angle between the percentage of Female in Parliament and Labour Force Participation Rate Female is small, so they have high positive correlation. But the angle between Education.exp and Parli.F is bigger than the angle between the percentage of Female in Parliament and Labour Force Participation Rate Female, so the Education.exp and Parli.F are not correlated like percentage of Female in Parliament and Labour Force Participation Rate Female

Load the tea dataset, look structure and dimension and visualize, do MCA, draw biplot

library(FactoMineR)
data(tea) #load tea
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
library(tidyr)
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) #visulize tea
## Warning: attributes are not identical across measure variables;
## they will be dropped

mca_tea <- MCA(tea_time, graph = FALSE) #run MCA
summary(mca_tea) # show result of MCA
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca_tea, invisible=c("ind"), habillage = "quali") # visualize MCA

Interpret the result of MCA: The results showed that besides alone and other, the coordinate of other variables in Dim1 are all significantly different from zero, so reject the null hypothesis (the value below -1.96 or above 1.96). And that would also for variables in Dim2 and Dim3. About the categorical variables, the value of Tea, how and where close to one in Dim1, indicates the strong link with these variables and Dim1. And this also for Dim2 and Dim3.

Interpret the biplot: The distance between variable categories showed that measure of their similarity. Thus, for example, the alone and unpackaged are similar, and no sugar and grey are similar, while alone is different from lemon.


Chapter 6 Analysis of longitudinal data

Analysis of chapter 8 of MABS using RATS dataset

Load the RATS and make it to RATSL

Interpretation: RATS data is a wide form data and it has to be changed to long form data. The commands below is to making the RATS (wide form) to RATSL (long form). Long-form data sets are often required for advanced statistical analysis and graphing. And now it is ready for the analysis.
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
library(dplyr)
library(tidyr)
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL <-  RATS %>% gather(key = WD, value = Weight, -ID, -Group)
RATSL <-  RATSL %>% mutate(Time = as.integer(substr(WD,3,4)))
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,...
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, ...
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, ...
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, ...

Draw plots of RATSL

Interpretation: The graphs showed that the group 1 rats' weights significant different from group 2 and group 3. The weight of group 1 rats is the smallest among these three groups.Group 2 have the bigest rat, but other rats in group group 2 are smaller than the rats in group 3. For the amount of the rats, the amount of rats in group 1 is higher than that in group 2 and 3.
library(ggplot2)
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

Standardise Weight and draw the plot again

Interpretation: Now, the weights of rats in group 1, 2, and 3 were all standardized. 
And the plot showed that the highest weight of group 2 did not increased as that in the plot of unstandardized data, and the rats that have not that higher weight did not not show that change. This means that the higher weight rats would be influenced more than lower weight rats after the standardization.
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdWeight = (Weight - mean(Weight))/sd(Weight) ) %>%
  ungroup()
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1,...
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, ...
## $ WD        <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "...
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 55...
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, ...
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.881900...
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized Weight")

Summary graphs

Interpretation: First, the RATSS with the mean and standard error of the variable Weight, and then the summary of the RATSS showed that the row is 33 and the columns is 4, including Group, Time, mean, and se. 
The plot of mean profiles showed that three groups about the mean Weight values may differ from each other, since there is no overlap among the three rats group mean profiles. And we can name group one the low weight group, group 2 moderate weight group, and group 3 high weight group respectively.
library(dplyr)
library(tidyr)
library(ggplot2)
n <- RATSL$Time %>% unique() %>% length() # number of days
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = sd(Weight) / sqrt(n)) %>%
  ungroup() # summary data with mean and standard error of Weight by Group and day
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATSS) # glimpse the data
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 3...
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375...
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3...
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) + geom_line() +
  scale_linetype_manual(values = c(1,2,3)) + geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"),    width=0.3) +  theme(legend.position = c(0.8,0.8)) +
  theme_bw() + theme(panel.grid.major = element_blank(),
                     panel.grid.minor = element_blank()) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)") # plot the mean profiles

Find the outlaw outlier

Interpretation: First, the BPRSL8S was created, and from the summary, we can see that there are 16 rows and 3 columns, including Group, ID, and mean.
The plot showed that: The rats in group 1 had the lowest mean weight among three groups, and the rats in group 2 is higher than group. The rats in group 3 have the highest mean weight. And all groups have outlier.
RATSL8S <- RATSL %>%
  filter(Time > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup() # create a summary by Group and WD with mean
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATSL8S) # glimpse RATSL8S
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 2...
ggplot(RATSS, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), days 1-64") # draw a boxplot of mean versus Group

filter the outlier and draw the data again

Interpretation: Now, the outlier were removed. It can also be seen in the plot. The other information was similar as the last plot. The group 3 had the highest rat weight, then the group 2, and the group 1 had the lowest rat weight.
RATSL8S1 <- RATSL8S %>%  filter(ID != 2 & ID != 12 & ID != 13) # create a new data by filtering the outlier
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), days 1-64") # draw the plot again

Performing T-test and Anova

Interpretation: The result showed that there is significant difference between group1, group2 and group3 in terms of weight of rats, since the p value is 2.721e-14, which is less than 0.05. 
oneway.test(mean ~ Group, data = RATSL8S1, var.equal = TRUE) # run Anova
## 
##  One-way analysis of means
## 
## data:  mean and Group
## F = 2577.4, num df = 2, denom df = 10, p-value = 2.721e-14
fit <- lm(mean ~ Group, data = RATSL8S1) # run linear model
anova(fit) # compute the analysis of variance table in order to fitted model with anova
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 175958   87979  2577.4 2.721e-14 ***
## Residuals 10    341      34                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis Chapter 9 of MABS using BPRS dataset

Load the BPRS and make it to BPRSL

Interpretation: BPRS data is a wide form data and it has to be changed to long form data. The commands below is to making the BPRS (wide form) to BPRSL (long form). Long-form data sets are often required for advanced statistical analysis and graphing. And now it is ready for the analysis.
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep  =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment) #convert BPRS treatment
BPRS$subject <- factor(BPRS$subject) # convert BPRS subject
library(tidyr)
library(dplyr)
BPRSL <-  BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject) # convert BPRS to long form
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,5))) # add week to BRPS
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "we...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 6...
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Draw a bprs against week plot

Interpretation: First, the dimension showed that there are 360 rows and 5 columns. 
The plot showed the all 40 men's bprs against time, it passed the repeated data, and identify each observation belong to which group. In addtion, the two groups is randomly distributed.
dim(BPRSL) # look at the dimension
## [1] 360   5
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_text(aes(label = treatment)) +
  scale_x_continuous(name = "week", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "top") # draw the plot

The linear model

Interpretation: The bprs is predictor variable and week is explanatory variable in this linear regression model. The treatment in group 1 is the baseline, and the treatment in group 2 can be 47.0261. There is no differece between two groups in term of treatment, since the p value of treatment2 is 0.661, which is more than 0.05. 
BPRSL_reg <- lm(bprs ~ week + treatment, data = BPRSL) # create a regression model
summary(BPRSL_reg) # print out the model summary
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Random Intercept model

Interpretation: In this model, the bprs is the predictor variable, and week and treatment are the explanatory variables. The result showed that AIC of the model is 2748.7, BIC is 2768.1, logLik is -1369.4, abd df.resid is 355. The min of scaled residuals is -3.0481, and the max of it is 3.4855. For the random effects, he intercept of the subject is 47.41, and the Std.Dev is 6.885, which indicates that subject may have make considerable differences in the model. The t value of week in fixed effects is 24.334 (>0), which means there is a greater evidence that there is significant difference. The correlation of fixed effects between week and treatment2 is 0.000.
library(lme4) # load lme4
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE) # run random intercept model
summary(BPRS_ref) # print out summary
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Random intercept and random slope model

Interpretation: The result showed that the fixed effects are quite similar to random intercept model. The random intercept and slope model offered a better fit for the data because of the small AIC value. The p value is 0.03, less than 0.05, which indicated that the difference is significant. And the chi-squared statistic is 7.27, and the degree of freedom is 2.
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE) # create a random intercept and random slope model
summary(BPRS_ref1) # print out the summary
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
anova(BPRS_ref1, BPRS_ref) # run anova
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

random intercept and random slope model with interaction

Interpretation: The result showed that the fixed effects and AIC value are quite similar to the last model. The p value is 0.07, more than 0.05, which indicated that the difference is not significant. And the chi-squared statistic is 3.17, and the degree of freedom is 1.
The two plots showed that the fitted bprs profiles from the interaction model and observed bprs profiles. The first one is the observed data, and the second one illustrates how well the interaction model fits the fitted data.
BPRS_ref2 <- lmer(bprs ~ + week * treatment + (week | subject ), data = BPRSL, REML = FALSE) # create a random intercept and random slope model with the interaction
summary(BPRS_ref2) # print a summary of the model
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ +week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
anova(BPRS_ref2, BPRS_ref1) # run anova
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ +week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(ggplot2)
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, color = interaction(subject, treatment))) 
p2 <- p1 + geom_line()  + geom_point()
p3 <- p2 + scale_x_continuous(name = "week")
p4 <- p3 + scale_y_continuous(name = "bprs")
p5 <- p4 + theme_bw() + theme(legend.position = "none")
p6 <- p5 + theme(panel.grid.major = element_blank(),
                     panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Observed")
graph1 <- p7    # draw a plot
Fitted <- fitted(BPRS_ref1) # create a vector of the fitted values
BPRSL <- BPRSL %>%
mutate(Fitted) # create a vector of the fitted values
p1 <- ggplot(BPRSL, aes(x = week, y = Fitted, color = interaction(subject, treatment))) 
p2 <- p1 + geom_line()  + geom_point()
p3 <- p2 + scale_x_continuous(name = "week")
p4 <- p3 + scale_y_continuous(name = "bprs")
p5 <- p4 + theme_bw() + theme(legend.position = "none")
p6 <- p5 + theme(panel.grid.major = element_blank(),
                     panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Fitted")
graph2 <- p7  

graph1; graph2